You've been given a dataset of about 500 malware binary files that have been found on your organization's computers. Whenever you find more malware, you want to be able to tell if you've seen a file like this before. Binary files are hard to understand. When code is written, there are several more steps before it becomes software. Some parts of this process are: i. Compiling, which turns human-readable source code into assembly code. Assembly code is difficult for humans to read, but it closely mimics the most basic raw instructions that a computer needs in order to run a program. ii. Assembling, which turns assembly code into machine code. Machine code is impossible for humans to read, but this representation is what a computer actually needs to execute. The malware binary files that were given to you to analyze are all in machine code, but luckily, you were able to run a program called a disassembler to turn them back into assembly code. Assembly code contains *instructions* which tell a computer how to update its own internal memory, and its progress through reading the assembly code itself. For instance, the `jmp` instruction means "jump to executing a different instruction", and the `add` instruction means "add two numbers and store the result in memory". Your dataset contains data about all the malware files, including their file hash, which serves as a name, and the counts of all of the `jmp` and `add` instructions. Malware attackers often release many slightly different versions of the same malware over time. These different versions always have totally different hashes, but they are likely to have similar numbers of `jmp` and `add` instructions.